agreement measure
Ensemble Clustering using Semidefinite Programming
We consider the ensemble clustering problem where the task is to'aggregate' multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to max- imize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel con- vexification techniques which can subsequently be relaxed to a polynomial time solvable SDP.
Analyzing Dataset Annotation Quality Management in the Wild
Klie, Jan-Christoph, de Castilho, Richard Eckart, Gurevych, Iryna
Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models as well as for their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, to our knowledge, large-scale analysis has yet to be performed on how quality management is conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions for applying them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects, such as annotator management, agreement, adjudication, or data validation. Using these annotations, we then analyze how quality management is conducted in practice. A majority of the annotated publications apply good or excellent quality management. However, we deem the effort of 30\% of the works as only subpar. Our analysis also shows common errors, especially when using inter-annotator agreement and computing annotation error rates.
A General Clustering Agreement Index: For Comparing Disjoint and Overlapping Clusters
Rabbany, Reihaneh (University of Alberta) | Zaïane, Osmar R. (University of Alberta)
A clustering agreement index quantifies the similarity between two given clusterings. It is most commonly used to compare the results obtained from different clustering algorithms against the ground-truth clustering in the benchmark datasets. In this paper, we present a general Clustering Agreement Index (CAI) for comparing disjoint and overlapping clusterings. CAI is generic and introduces a family of clustering agreement indexes. In particular, the two widely used indexes of Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), are special cases of the CAI. Our index, therefore, provides overlapping extensions for both these commonly used indexes, whereas their original formulations are only defined for disjoint cases. Lastly, unlike previous indexes, CAI is flexible and can be adapted to incorporate the structure of the data, which is important when comparing clusters in networks, a.k.a communities.
Ensemble Clustering using Semidefinite Programming
Singh, Vikas, Mukherjee, Lopamudra, Peng, Jiming, Xu, Jinhui
We consider the ensemble clustering problem where the task is to'aggregate' multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize thenew agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniqueswhich can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases.